After you graduated, you started to work for one of the best firms in the country. You were hired because you have Data Analysis skills with \(R\). During your first week your manager comes to your office and gives you the following data set and ask you to “analyze the hell out of this data” (his words. not mine). Mainly he wants you to build a linear model to predict executive salaries. But you know you can do much more! Analyze the given data and create a report like your job depends on this.
Data stored as a .txt file under week 10. Data consists of 11 variables.
| id | Y | X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 | X9 | X10 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 11.4436 | 12 | 15 | 1 | 240 | 170 | 1 | 44 | 5 | 0 | 21 |
| 2 | 11.7753 | 25 | 14 | 1 | 510 | 160 | 1 | 53 | 9 | 0 | 28 |
| 3 | 11.3874 | 20 | 14 | 0 | 370 | 170 | 1 | 56 | 5 | 0 | 26 |
| 4 | 11.2172 | 3 | 19 | 1 | 170 | 170 | 1 | 26 | 9 | 0 | 24 |
| 5 | 11.6553 | 19 | 12 | 1 | 520 | 150 | 1 | 43 | 7 | 0 | 27 |
| 6 | 11.1619 | 14 | 13 | 0 | 420 | 160 | 1 | 53 | 9 | 0 | 27 |
| Variable | Name | Description |
|---|---|---|
| y1 | salary | Salary of executive |
| x1 | experience | Experience(in years) |
| x2 | education | Education (in years) |
| x3 | gender | Gender (1 if male 0 if female) |
| x4 | emps_sump | Number of employees supervised |
| x5 | assets | Corporate assets (in millions of USD) |
| x6 | board_mb | Board member (1 if yes, 0 if no) |
| x7 | age | Age (in years) |
| x8 | profit | Company profits (in millions of USD) |
| x9 | int_res | Has international responsibility (1 if yes, 0 if no) |
| x10 | sales | Company’s total sales (in millions of USD) |
| salary | experience | education | gender | emps_sup | assets | board_mb | age | profit | int_res | sales |
|---|---|---|---|---|---|---|---|---|---|---|
| 11.4436 | 12 | 15 | 1 | 240 | 170 | 1 | 44 | 5 | 0 | 21 |
| 11.7753 | 25 | 14 | 1 | 510 | 160 | 1 | 53 | 9 | 0 | 28 |
| 11.3874 | 20 | 14 | 0 | 370 | 170 | 1 | 56 | 5 | 0 | 26 |
| 11.2172 | 3 | 19 | 1 | 170 | 170 | 1 | 26 | 9 | 0 | 24 |
| 11.6553 | 19 | 12 | 1 | 520 | 150 | 1 | 43 | 7 | 0 | 27 |
| 11.1619 | 14 | 13 | 0 | 420 | 160 | 1 | 53 | 9 | 0 | 27 |
In console
library(ISLR)
View(df)
?df
Skim summary statistics
n obs: 100
n variables: 11
── Variable type:factor ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
variable missing complete n n_unique top_counts ordered
board_mb 0 100 100 2 0: 51, 1: 49, NA: 0 FALSE
gender 0 100 100 2 1: 66, 0: 34, NA: 0 FALSE
int_res 0 100 100 2 0: 82, 1: 18, NA: 0 FALSE
── Variable type:integer ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
variable missing complete n mean sd p0 p25 p50 p75
age 0 100 100 42.84 9.07 23 37 42.5 49.25
assets 0 100 100 175.1 15.41 150 160 180 190
education 0 100 100 16.02 2.3 12 14 16 18
emps_sup 0 100 100 340.1 167.18 60 187.5 360 492.5
experience 0 100 100 13.08 7.34 1 7.75 13 20
profit 0 100 100 7.7 1.55 5 6 8 9
sales 0 100 100 24.83 2.74 20 23 25 27
p100 hist
64 ▃▃▇▇▆▆▃▂
200 ▃▇▁▆▇▁▇▃
20 ▇▃▅▅▆▆▆▁
600 ▇▆▅▆▇▆▇▇
26 ▇▃▆▇▃▃▇▅
10 ▂▇▁▇▆▁▇▆
30 ▃▃▃▇▂▃▂▃
── Variable type:numeric ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
variable missing complete n mean sd p0 p25 p50 p75 p100
salary 0 100 100 11.46 0.26 10.66 11.28 11.46 11.61 12.06
hist
▁▁▃▇▇▇▃▂
The minimum value of age = 23, assets = 150, education = 12, emps_sup = 60, experience = 1, profit = 5, sales = 20, and salary = 12.06.
The maximum value of age = 64, assets = 200, education = 20, emps_sup = 600, experience = 26, profit = 10, sales = 30, salary = 10.66.
The mean value of age = 42.84, assets = 175.1, education = 16.02, emps_sup = 340.1, experience = 13.08, profit = 7.7, sales = 24.83, salary = 11.46.
The standard deviation of age = 9.07, assets = 15.41, education = 2.3, emps_sup = 167.18, experience = 7.34, profit = 1.55, sales = 2.74, salary = 0.26.
From these histograms we can see that;
# A tibble: 8 x 2
Variable Distribution
<chr> <chr>
1 age Normal
2 assets Random
3 education Mostly Uniform
4 emps_sup Mostly Uniform
5 experience Random
6 profit Random
7 sales Skewed Right
8 salary Skewed Left
* This means the number of people with an age between 33.77and 51.91 is larger than the number of people of ages outside this range.
* This means the value of assets is random across the population.
* This means the number of people with any number of years of education is evenly distributed.
* This means the number of employees supervised is evenly distributed.
* This means the number of years of experience is random across the population.
* This means the value of profit is random across the population.
* This means the company's total sales is \$25 million or below.
* This means the executive salary is \$11.46 million or above.
Visual Descriptions:
The distribution for males (1) is higher than the distribution for females (0).
The distribution for board members (1) and non-board members (0) are approximately the same. The distribution for non-board members is slightly higher than board members.
The distribution for people that do not have international responsibility (0) is significantly higher than people who do have international responsibility (1).
The distribution for people with 20 years of education is significantly lower than the distribution for people with 12.5 years to less than 20 years of education.
The greatest distribution for age is between 30 years and 45 years.
The distribution of people 60 years of age and older are the lowest compared to the distribution of people between the ages of 30 and 45 years of age.
The mean salary for males (1) is higher than the mean salary for females (0).
There is a positive linear relationship between a person’s experience (in years) and their salary.
There is a positive linear relationship between a person’s age and their salary.
The mean salaries for people with international responsibility (1) and with no international responsibility (0) are approximately even.
Using a linear model with parallel slopes, we can predict an executive’s salary (in millions) based on their experience, education, gender, and assets.
experience, education, gender, and assets all have significant positive correlation to salary that will be included in our linear model.
\(\hat{Salary} = 10.14 + 0.027 \cdot experience + 0.022 \cdot education + 0.003 \cdot assets + 0.185 \cdot 1_{Male}(x)\)
Male executive model: \(\hat{Salary} = 10.325 + 0.027 \cdot experience + 0.022 \cdot education + 0.003 \cdot assets\)
Female executive model: \(\hat{Salary} = 10.14 + 0.027 \cdot experience + 0.022 \cdot education + 0.003 \cdot assets\)
In our base model, we could extrapolate that executives have a salary of $10.14 million assuming they have no experience and no education, With every extra year of experience and education, one could expect their salary to increase by $27,000 and $22,000 million respectively. Male exexcutives, on average, make $185,000 more than their female counterparts with similar experience, education, and assets.
Using an interaction model, we can use both the experience and gender variables to see how they interact with each other in terms of salary.
\(\hat{score} = 11 + 0.026 \cdot experience + 0.174 \cdot 1_{Male}(x) + 0.002 \cdot experience \cdot 1_{Male}(x)\)
Female experience model: \(\hat{score}_F = 11 + 0.026 \cdot experience\)
Male experience model: > \(\hat{score}_M = 11.174 + 0.028 \cdot experience\)
As we can see from the models, male executives have both higher base salaries than women in addition to marginally higher increase in salaries with an increase in experience. However, as evidenced from the graph, this interaction between experience and gender is negligible, as both genders encounter an increase in pay at the same rate.